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A linear bound on the number of states in optimal 
convex characters for maximum parsimony distance 


Olivier Boes* Mareike Fischerj Steven Kelk^ 


Abstract 

Given two phylogenetic trees on the same set of taxa X, the maximum parsimony 
distance c^mp is dehned as the maximum, ranging over all characters V on A, of 
the absolute difference in parsimony score induced by X on the two trees. In this 
note we prove that for binary trees there exists a character achieving this maximum 
that is convex on one of the trees (i.e. the parsimony score induced on that tree is 
equal to the number of states in the character minus 1) and such that the number of 
states in the character is at most TcZmp ~ 5. This is the first non-trivial bound on the 
number of states required by optimal characters, convex or otherwise. The result 
potentially has algorithmic significance because, unlike general characters, convex 
characters with a bounded number of states can be enumerated in polynomial time. 


1 Introduction 

When phylogenetic trees are inferred from different genes or with different methods, the 
outcome are often topologically distinct trees, even when the underlying set of species 
is the same pQ. It is natural to ask how different these trees really are, which is why 
different metrics on phylogenetic trees have been suggested [2]. To name just a few, there 
is for example the Robinson-Foulds distance |3], as well as tree rearrangement metrics like 
the SPR distance or the TBR distance |1]. Recently, another metric has been proposed: 
maximum parsimony distance c^mp |Sl E] , which is a lower bound on TBR distance (and 
thus also SBR distance). Informally this metric consists of hnding a character with a 
low parsimony score on one of the trees and a high parsimony score on the other i.e. it 
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seeks a character that, from a parsimony perspective, distinguishes the most between the 
two trees. Although the metric is based on the parsimony score of a tree, which can be 
computed in polynomial time using e.g. Fitch’s algorithm [7], the metric itself is (like 
SPR and TBR distance) NP-hard to compute, even on binary trees O |8]. The metric 
also seems extremely difficult to compute in practice, with exact algorithms based on 
Integer Linear Programming (ILP) currently limited to trees with 15-20 leaves |H]. 

In [HI E] it has been shown that, with a view towards developing more efficient 
exponential-time algorithms, the search for optimal characters can be restricted to char¬ 
acters which are convex (equivalently, homoplasy-free [6]) on one of the two trees under 
investigation i.e. the parsimony score on that tree is the number of states in the character 
minus 1. This immediately yields a trivial algorithm with running time 0(4” ■ poly(n)), 
where n is the number of leaves in the trees: guess which tree is convex, and then guess the 
subset of the 0(2n) edges in this convex tree where mutations occur. This leads naturally 
to the question; if dup is bounded (i.e. “small”), is it sufficient to restrict our search to 
convex characters with a bounded number of states (i.e. to locating bounded-size subsets 
of mutation edges in the convex tree), irrespective of the number of leaves n in the trees? 
Such questions are pertinent to the development of hxed parameter tractable algorithms 
i.e. algorithms that run quickly on trees with a large number of leaves as long as the 
distance is small (see e.g. [9] for related discussions). Prior to this note the best bound on 
the number of states required was [n,/2j [5l [8]. Here we show that the number of states 
required can indeed be decoupled from n. In particular we show that optimal convex 
characters exist with at most Tdyip — 5 states, which is sharp for dup = 1- 

We conclude with a discussion of the rather subtle complexity consequences of this 
result, and whether there is room to tighten the bound further. 


2 Preliminaries 

An unrooted binary phylogenetic X-tree T is a tree with only vertices of degree 1 (leaves) 
or 3 (inner vertices) such that the leaves are bijectively labeled by some hnite label set 
X (where X is often called the set of taxa). For brevity, such a tree will simply be called 
X-tree in the following. A character on X is a surjective map X : X —)■ C where C is a set 
of character states; the number of distinct states in the character is denoted by |x|. An 
extension X of a character X to a whole X-tree T is a map X : V(T) —)■ C such that x{x) = 
X{x) for all X E X. A mutation induced by X in T is an edge G £{T) satisfying 

X{u) 7^ X(n), and we write A(T, x) for the set of all mutation edges. The extension X 
is said to be most parsimonious if it achieves the minimum number of mutations over 
all possible extensions to T of the character X. This leads naturally to the dehnition of 
parsimony score. 
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Definition 2.1. Let T be any X-tree and let X be any character on X. 

Then the parsimony score of X on T is 

£(T,x) := min |A(T,x)| = min |{ {n, n} G ^(T) | x{u)^x{v)}\ 

X X 

where the minimnm is taken over all possible extensions X of the character X to T. 

It is well-known that ^(T, x) > |x| — 1 • When a character X achieves this £(T, x) = 
|x| — 1 minimnm, then X is said to be a convex character on T. Some anthors follow 
a slightly different (bnt equivalent) path, by defining the homoplasy score h{T,x) ■ = 
£(T, x) — |x| + 1 of a character X on T [B]. In this terminology, we have hiT^x) > 0 
and a character x attaining the h(T, x) = 0 minimum is said to be homoplasy-free (with 
respect to T). Clearly, a character is convex if and only if it is homoplasy-free. 

Although characters are defined on a set X of taxa, this set of taxa will often be made 
implicit, allowing us to speak of a character on an X-tree. We now use the parsimony 
score to define a distance function on pairs of X-trees. 

Definition 2.2. Let (Ti,T 2 ) be a pair of X-trees. 

Then the maximum parsimony distance between Ti and T 2 is 

ciMp(7'i,T2) := max |£(Ti,x)-^(T 2 ,x) I 

where the maximum is taken over all possible characters X on X. 

It is known that dup is a metric on unrooted phylogenetic trees |5], hence we call it 
a distance. However it is not a metric on rooted phylogenetic trees, because then we lose 
identity of indiscernibles (i.e. we only get a pseudometric). 

A character X on a set X of taxa is said to achieve distance fc on a pair (Ti,T 2 ) of 
X-trees when |£(Ti,x) — (-{T 2 X)\ = If this character achieves distance T 2 ), 

then we say that X is an optimal character for this pair of trees. 

An optimal character for a pair of trees which has the additional property of being 
convex on at least one of the trees is (predictably) called an optimal convex character 
(for this pair of trees). 


3 Result 

We recall the following earlier result, proven in [5l Theorem 3.6] and [8], Observation 6.1]: 

Theorem 3.1. [3 [8] Any pair (Ti,T 2 ) of X-trees admits an optimal convex character 
with at most [|X|/2J states. 
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Our main result is the following new bound which is independent of |X|. This is 
particularly advantageous when ^mp is small and |X| is large. 

Bounded States Theorem. Any pair (Ti,T 2 ) of X-trees admits an optimal convex 
character with at most 7 ■ dup (Ti, T 2 ) — 5 states. 

We will prove this theorem subsequently, but hrst we need to introduce some more 
concepts and lemmas in the following two sections. 


3.1 The forest induced by a character extension 


In this section we dehne the forest F induced by an extension X (of a character X to a 


X-tree T); this construction will be extensively used in the proof of the Bounded States 
ITheoreml 


Let us assume that X creates (p—1) mutations in T. If we delete all these mutation 
edges, we are left with a forest F having p connected components. Each of these compo¬ 
nents is a subtree of T, whose vertices all share a common character state (assigned by 
X). We then say that two components of F are adjacent if the two corresponding subtrees 
of T are connected by one mutation edge (they cannot be connected by more than one 
mutation edge, since there are no cycles in T). This yields a graph structure G{F) where 
the vertices are the components of F and the edges are the (unordered) pairs of adjacent 
components, which can be identihed with the mutation edges of T. G{F) has p vertices 
and (p — 1) edges, and must be connected since T is connected: therefore G{F) can be 
seen as a tree in its own right. Figure |TT] gives a concrete example of such an induced 
forest. 




(a) The forest F. 


(b) The graph G{F). 


Figure 3.1: The forest F induced by a most parsimonious extension X of the character X = 
(CBCBDDBDAEEABABC) on an X-tree with leaves labeled from 1 to 16, along with its graph 
structure G{F). States B and C are repeating states, while all others are unique states. 
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When X is a most parsimonious extension, each component of the forest must contain 
at least one leaf of T. This in turn implies that a most parsimonious extension never 
introduces redundant states i.e. states that were not in the original character. Also, keep 
in mind that the forest (and its tree structure) depends on the choice of the extension X- 
even two different most parsimonious extensions may yield different induced forests. We 
conclude this section with some useful terminology and related lemmas. 

Definition 3.1. Let F be the forest induced by a most parsimonious extension X of a 
character X- Let C be the set of states used by X (which will be equal to the set of states 
used by x). We can distinguish between different kinds of states and components: 

• a state of X is unique if it is assigned to exactly one component of F, 

• a state of X is repeating if it is assigned to at least two components of F, 

• a component of F is unique if its assigned state is an unique state of X, 

• a component of F is repeating if its assigned state is a repeating state of X- 
Note that each state is either unique or repeating, but not both. 

The following lemma gives useful bounds on the numbers of unique or repeating states 
and components for a given induced forest. 

Lemma 3.1. Let F be the forest induced by any most parsimonious extension X of any 
character X : W —)■ C to any X-tree T. The total number of components in F is |x| + h = 
£(T,x) + 1, where h := h(T,x) is the homoplasy score of X on T. Then the following 
inequalities are satisfied. 


|X| 

-h 

< 

number of 

unique 

states 

< 

|x| 


0 

< 

number of 

repeating 

states 

< 

h 

|T| 

-h 

< 

number of 

unique 

components 

< 

|T| 


h 

< 

number of 

repeating 

components 

< 

2h 


Furthermore, X is convex h = 0 all states and components are unique. 

Proof. Let us partition C into two sets C\j and Cr, respectively containing the unique 
states and the repeating states. The set of components in F is similarly split into two 
sets Fu and Fr. Clearly, we have: |Cu| + |Cr| = |x| and |Fu| + |Fr| = |x| + h. 

Now, according to Dehnition |3.1| a state is repeating if it is assigned to at least two 
(repeating) components of F, and every component has exactly one state assigned to 
it, so we must have 2 |Cr| < |Fr|. It is also clear that |Cu| = |Fu|, because there is a 
one-to-one correspondence between unique states and unique components. Using these 
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two observations and the two preceding equalities, we find: 


|Cu| + 2 |Cr| < |Fu| + I-Fr 

|x| + |Cr,| < \x\ + h 


Then canceling the |x| term in both sides and combining with the obvious 0 < |Cr| 
bound gives the second inequality of the lemma, which in turn lead to all three others: 


0 

< 


< 

h 

( 2 nd 

inequality) 

0 

< 

|x| - |Cu| 

< 

h 



-h 

< 

0 

c; 

1 

< 

0 



\x\-h 

< 

l^ul 

< 

|X| 

( 1 st 

inequality) 

\x\-h 

< 

|Fu| 

< 

|X| 

(3rd 

inequality) 

\x\-h 

< 

\x\ + h — Fr 

< 

|X| 



-|x| 

< 

|Fr| - \x\ - h 

< 

h-\x\ 



h 

< 

\Fk\ 

< 

2h 

( 4 th 

inequality) 


Moreover, if h = 0, with the 1st inequality we get \Cu\ = |x|, and with the 3rd 
inequality we get |Fu| = |x|, which implies that all states and all components are unique. 
On the other hand, if all states and components are unique, we have I^rI = 0, which 
leads to h = 0 by the 4th inequality. This completes the proof. □ 


3.2 Relabeling states and sufficient conditions for the existence 
of “good” pairs of states 

Here relabeling the states of a given character X '■ X ^ C simply means composing it with 
some surjection ip : C ^ C' in order to produce a new character x! ^ ° X '■ X ^ C. 


Clearly, \x'\ < |x| and i{T,x') < i{T,x) for every X-tree T. The proof of the Bounded 


States Theorem is based on a relabeling argument in which only one state of the character 
is relabeled, i.e. when (p(A) = B for two states A, B G C but ip stays the identity on states 
other than A. The high-level idea is to show that, whenever an optimal convex character 
exists with more than Tdup (Fi, T 2 ) — 5 states, it will always be possible to find two states 
A and B such that relabeling A as B causes the parsimony score of both trees to decrease 
by exactly one. That is, a new optimal convex character with fewer states can be found, 
and the theorem will follow. 

Let (Ti,T 2 ) be a pair of X-trees and let y be an optimal convex character for this 
pair. Without loss of generality, let y be convex on Ti. Let Xi be a most parsimonious 
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extension of x to Ti and X 2 a most parsimonious extension of x to T 2 . Let Fi and F 2 be 
the forests induced by Xi and X 2 respectively. We say that two components A and B are 
Fi-adjacent if they are adjacent in the forest F*. (Note that if a state is unique, or we are 
focussing on Fi, the term “state” and “component” can be used interchangeably.) 

Observation 3.1. Let A and B be two distinet states that are Fi-adjaeent. Let x' be 
the new eharaeter obtained by relabeling A ;= B. Then x' is a eonvex eharaeter. In 
partieular, i(Ti, x') = ^(Fi, x) — 1 and x' uses exaetly one fewer state than X- Moreover, 
if i{T2,X') > i(T2,x) — 1, then x! is an optimal convex eharaeter (that uses exaetly one 
fewer state than x)- 

Proof. Relabeling A := B within the extension Xi yields an extension Xi (of x') such 
that |A(Ti,Xi)| < |A(Ti,W)| — 1- This is because a mutation is saved on the edge 
generating the adjacency between A and B. Hence, i(Ti,x') < i(Ti,x) — 1. Given 
that Ix^l = Ixl ~ 1) and the natural lower bound i(Ti,x') > \x'\ — 1, it follows that 
£(Ti,xO > |x'| —1 = |xl~2 = f'(Ti,x) —1, and the convexity of x' follows. If, additionally, 
£(T2, x') > ^(^ 2 , x) — 1 then the optimality of x' is immediate. □ 

We are thus interested in identifying states A and B with the following property: A 
and B are Fi-adjacent and i{T 2 , X') > I{T 2 , X) — 1 where x' is obtained by taking A ;= B. 
We call such a pair of states a good pair. 

Given an A-tree T and an edge e of T, deleting e breaks T into two connected 
components and this naturally induces a bipartition P\Q of X. We say then that P\Q is 
the split generated in T by e. 

Lemma 3.2. Let A and B be two distinet states that are Fi-adjaeent and let A^, Ag C A 
be the taxa that are labeled with A, B respectively. Suppose that in T 2 , there exists an edge 
e that generates a split P\Q, where A^ F F and Ag C Q. Then (A, B) is a good pair. 

Proof. It is sufficient to prove i(T 2 ,x') > i(T 2 ,x) — 1. Suppose, for the sake of con¬ 
tradiction, i(T 2 ,x') < i(T 2 ,x) — 2. Let X 2 be a most parsimonious extension of x' fo 
T 2 . Deleting e from T 2 breaks V(T 2 ) into two connected components and Vg, one 
containing all taxa A/^ and the other containing all taxa Ag. (Note that here A/^, Ag 
refer to the taxa that were labeled A and B before the relabeling). We adjust X 2 fol¬ 
lows: every vertex that is in Va and labeled with state B, is switched to state A. This 
yields an extension X of x fo T 2 such that |A(T 2 ,x)| < |A(T 2 ,X 2 )| + 1- This is be¬ 
cause the only new mutation that can be created is on the edge e. However, this implies 
|A(T 2 ,x)| a |A(T 2 ,X 2 )| + 1 a ^(^ 2 , xO + 1 a (f*(T 2 ,x) — 2) -|- 1 < £(T 2 ,x), yielding a 
contradiction. □ 

Recall the dehnitions of unique and repeating from earlier. We emphasise that here 
we classify states as unique or repeating with reference to F 2 (which is induced by W)- 
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Observation 3.2. Let A and B be two distinct states that are Fi-adjacent and let IK he a 
unique state. Let C X be the taxa that are labeled with A, B respectively. Suppose 

that in T 2 , there exists an edge e that generates a split (i.e. the taxa form a 

“pendant subtree” in T 2 ). Then (A, B) is a good pair. 

Observation 3.3. Let A and B be two distinct states that are Fi-adjacent and such that 
both are unique. Then (A, B) is a good pair. 


Proof. Observation 3^ is immediate from Lemma 3^ Observation 3^ is slightly more 
snbtle. The point here is that if a state U is uniqne then in T 2 all the vertices allocated 
state U (by extension X 2 ) form a single connected subgraph. In particular this applies to 
both A and B. Given that these two states are necessarily distinct, any simple path in T 2 
between these two connected subgraphs must pass through some edge in A{T 2 ,X 2 ), and 
this edge generates a split with ah the A taxa on one side and ah the B taxa on the other, 
so Lemma |3.2| applies. □ 


See hgure [3^ for an example where Observations 3^ and 3^ may be used to decrease 
the number of character states. 












(a) Before any relabeling. 

X = (CBCBDDBDAEEABABC) 



(h) After relabeling E := A. 
X' = (CBCBDDBDAAAABABC) 



(c) After relabeling D := A. 
X” = (CBCBAABAAAAABABC) 


Figure 3.2: Suecessive applications of Observations \ 3.^ and |3.3| to decrease the number of 
states used by an optimal convex character. Only the second forests (F 2 and its subsequent 
transformations), along with their corresponding graph structures, are shown in these figures. 


(a) The original F 2 forest before any relabeling of the states of the X character. The state E 
is unique and its component in F 2 is a pendant subtree. Assuming that E is Fi-adjacent 
to A, Ohservation \3 .2\ applies and we may relabel E := A. This gives a new optimal convex 
character x' which does not use the state E anymore. 


(b) The forest F 2 induced by a most parsimonious extension x'2 of x' to T 2 (note that this 
is not the only possibility: another x '2 could induce another FL). States A and D are both 


applies and we may relabel D := A. This gives yet another optimal convex character x". 

(c) The forest Fif induced by a most parsimonious extension xf) of x" to T 2 . Only three 
states A, B, and C are used by x”, compared to five states in the original X character. 


unique in F^. Assuming F[-adjacency (where F[ is induced by some x'l), Observation 


3.3 


Lemma 3.3. Let A and B be two distinct states that are Fi-adjacent where A is a unique 


state. Assume the situation described in Observation \3.2\ does not hold, i.e. there is no 
edge e which generates a split in T 2 . If there exists a unique state C 7^ A such that 

A and C are F 2 -adjacent and both of degree 2 in G{F 2 ), then (A, B) is a good pair. 


Proof. If A and B are both unique then we are done, by Observation |3.3[ Hence we may 
assume that B is a repeating state i.e. there are at least 2 components in F 2 that have state 
B. Let V/\, Vc C V(T 2 ) be those vertices of T 2 that are allocated state A, C (respectively) 
by X 2 - Let Xq C X be defined similarly for taxa. We have |Xa|, |Xc| > 2 


because otherwise the situation in Observation 3^ would trivially apply. 

Let G A(T 2 ,X 2 ) he the edge of T 2 that dehnes the adjacency between A and C 
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in F 2 . Let e/\ G A(T 2 ,X 2 ) be the edge of T 2 that dehnes the adjacency between A and 
its other neighbouring component in F 2 . Dehne cq correspondingly for state C. These 
three edges are uniquely dehned and have no endpoints in common. This is because of 
the assumption that Observation m does not apply, the fact that T 2 is a binary tree, 
and the degree 2 restriction. See hgure 3.3 (top subhgure) for a schematic depiction of 
the situation. 


Observe that, if P is any simple path (in T 2 ) from a taxon in X/^ to a taxon in Xg, 
then exactly one of the following two situations must hold: (1) P traverses edge e/\; 
(2) P traverses both edges e^c and ec- This, again, is a consequence of the degree 2 
assumption. We will use this insight in due course. 

As usual let x' be the character obtained by relabeling A := B within y. (We emphasize 
that Va, Vc, Xy, Xg, Xc are dehned before the relabeling.) Assume, again for the sake 
of contradiction, that i(T2, x') < ^(^ 2 , X) — 2. Let ^2 be a most parsimonious extension 
of y' to T 2 . We say that y 2 if, in x '21 there is a simple path P from some 

taxon in Xy to some taxon in Xg such that all vertices on P are allocated state B by x !2 
and P traverses edge ey. We say that y 2 is right merging if, in X 2 , there is a simple path 
P from some taxon in Xy to some taxon in Xg such that all vertices on P are allocated 
state B by y 2 P traverses both edges edge eye and cq. Note that y 2 might be left 
merging, right merging, both or neither. Depending on the exact combination, we use a 
different relabeling strategy. 



y 2 satisfies the 
lemma requirements. 


X 2 is both left merging 
and right merging. 


Figure 3.3: Top: 
of that lemma. 


the situation described in Lemma 


3.3 


Bottom: the fourth case in the proof 


The simplest is the case when X 2 is neither left merging nor right merging. In this case, 
consider the subgraph of T 2 induced by vertices that are allocated state B by y2- la general 
this subgraph might be disconnected. Delete all connected components of the subgraph 
that do not contain at least one taxon from Xy. Now, let V’ be the vertices that remain. 
We create an extension y of y from y2 by relabeling all vertices in V' to state A, and leaving 
the other vertices untouched. (There is no danger that a taxon in Xg will be labeled with 
state A because that would mean y2 was left and/or right merging, which we exclude by 
assumption.) Given that Xy will by construction be a subset of V', X is indeed a valid 
extension of y. Moreover, A(T 2 ,y) = A(T 2 ,y 2 )- This is because, due to the fact that y 2 
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is neither left or right merging, the transformation of X 2 ^ cannot create any new 
mutations. This then gives i{T 2 ,x) < |A(T 2 ,x)| = |A(T 2 ,X 2 )I = < K'^ 2 ,x) - 2, 

and we have our desired contradiction. 


If x '2 is left merging but not right merging, consider the subgraph of T 2 induced by 
vertices that are allocated state B by X2- Delete edge from the subgraph. (It will 
dehnitely be in the subgraph because X2 is left merging). Next delete all connected 
components of the subgraph that do not contain at least one taxon from As above, 
transform x'2 into X, an extension of X, by relabeling all the surviving vertices from B to A. 
The transformation can only increase the number of mutations by at most 1: on the edge 
CA. Hence i{T 2 ,x) < |A(T 2 ,x)| < |A(T 2 ,X 2 )| + 1 = i{T 2 ,x') + 1 < im^x) - 2) + 1 = 
£{T 2 , x) — 1 , and we again have a contradiction. 

If X 2 is right merging but not left merging, we do exactly the same as in the previous 
paragraph, except that we delete cac instead of ca- This again yields the contradiction 
^(T2,x)<£(T2,x)-1. 


The final, and most complicated case, is when X 2 is both left merging and right 
merging (see hgure 3.3, bottom subhgure). Here we convert X 2 into X as follows: all 
vertices in Va are switched to state A, and all vertices in Vc are switched to state C. 
This can create a new mutation on edge e/\. (The relabeling might cause some mutations 
inside Va to disappear, which can only help us, but for the sake of the proof we shall 
not assume this advantage exists). The relabeling can also create new mutations on e/\c 
and eQ. However, these two mutations are compensated for by the disappearance of at 
least two mutations inside V^. The argument is as follows. Clearly, C 7 ^ B because C is 
unique. The fact that X 2 is right merging means that (in X 2 ) it is possible to walk along 
a simple path from some taxon in Xa to some taxon in Xg, such that every vertex in the 
path has state B, and the path traverses cac and cq. Recall that |Xc| > 2 and C was not 
“pendant” in X 2 (due to the assumption that Observation 3.2 does not hold). Hence in 
X 2 there are at least two mutations of the form B — C on the set of edges whose endpoints 
are completely contained inside V^. It is precisely these mutations that disappear when 
we completely relabel Vc to state C. Due to this compensation effect the total increase 
in the number of mutations when transforming X 2 into X is at most 1. This yields the by 
now familiar conclusion i{T 2 , x) < £{T 2 , X) — 1, and thus a contradiction. □ 


3.3 The bounding function 


In this final section we show that, whenever an optimal convex character exists with 
strictly more than Tdup (Ti, T 2 ) — 5 states, then a good pair of states will dehnitely exist, 
allowing us to reduce the number of states in the character whilst preserving optimality 
and convexity. This will complete the proof of the Bounded States Theorern] 
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In particular, we will show that at least one of the situations described in Lemma 3.3 
Observation 3.2 and Observation 3.3 will hold. To begin we need an auxiliary lemma. 


Lemma 3.4. Let T = {V,E) be a (not necessarily phylogenetic) tree in which V is 
partitioned into a set R of red vertices and a set B of blue vertices and all leaves of T 
are red. If \B\ > 3|i?| — 4, then there exist two adjacent vertices ui ^ U 2 both of which 
are blue and of degree 2. 


Proof. Suppose for the sake of contradiction that this is not true. Let T be a counter¬ 
example: all its leaves are red, and \B\ > 3|i?| — 4, but the two vertices with the described 
property (henceforth called a “(mi, M 2 ) pair ”) do not exist. Now, suppose T has an internal 
vertex v that is red. We introduce a new vertex v', attach it by an edge to v, colour v' 
red and colour v blue. This increases the number of blue vertices by one and preserves 
the number of red vertices. Moreover, due to the fact that v now has degree at least 
3, this operation cannot cause a Mi,M 2 pah to arise. Hence, this new tree is also a 
counterexample. We repeat this until we obtain a tree T' whose leaves are all red and 
whose internal vertices are all blue. Let R' and B' be the set of red and blue vertices of 
T'. By the previous argument, \B'\ > 3|i?'| — 4. Now, if one suppresses all vertices in T' 
of degree 2, we obtain a tree T” on \R'\ leaves with at most |i?'| — 2 internal vertices and 
at most 2|i?'| — 3 edges (note that these values correspond to the binary case). We can 
obtain T' from T" by subdividing each edge of T" at most once. Hence, 

\B'\ < \R'\ -2 + (2|i?'| -3) 

= 3|i?'| -5 


and this yields a contradiction. 


□ 


Now, let X, Xi, X 2 , -^ 1 , ^ 2 , G{F 2 ) be dehned as at the beginning of the previous section, 
and let X use strictly more than Idup — 5 (i.e. at least Idup — 4) states where here we 


write divip as short for dup (T'i,T 2 ). If Observation 3.2 or Observation 3.3 holds then we 


are done. Otherwise, consider the following: Ti is convex so achieves a parsimony score 
exactly equal to |x| — 1. T 2 achieves a parsimony score exactly equal to |x| — 1 + c^mp, so 


the homoplasy score h of T 2 is exactly divip- Then, by Lemma 3.1 (1st inequality) there are 
at least |x| — dup > Sc^mp — 4 unique states and at most 2dMP (4th inequality) repeating 


components (in F 2 ). We know that, because Observation 3.2 does not hold, none of the 
leaves of 0 (^ 2 ) are unique states. In particular, all the leaves of 0 (^ 2 ) are repeating 


components. Now, if we view repeating components as “red” vertices in Lemma 3.4 and 
unique states as “blue”, we need Gd^p — 4 > 3(2(iMp) — 4 to be able to use Lemma 3.4 


This holds, so we are done: in particular. Lemma [3.4| shows the existence of a good pair 
via the situation described in Lemma 13.31 
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4 Discussion 


The bound Tc^mp — 5 is sharp for the case dup = 1: clearly at least 2 states are needed to 
achieve a distance of 1 or more. For dup > 2 there is probably room to improve the bound, 
and this is an interesting direction for future research. For ^mp = 2 a slight generalization 


of the arguments used in the proof of Lemma 3.3, combined with an ad-hoc case analysis 
can be used to easily reduce the bound from 9 to 7. Increasingly complex arguments 
can be utilized to reduce this further: we conjecture that 3 states are actually sufficient 
when divip = 2. These arguments do not easily lead to any signihcant improvement in the 
general T^mp — 5 bound and are not included here. However, they raise the intriguing 
(although somewhat speculative) question of whether c^mp +1 states are always sufficient; 
the example given later in this section shows that they are sometimes necessary. 

From an algorithmic perspective the bound has the following implications. If /c is a 
verihed upper bound on c^mp, then we can guarantee to hnd an optimal (convex) character 
achieving dup simply by guessing which of Ti and T 2 is convex and then looping through 
all at most 

/2|X| -3^ 

1 


7fc-5 

i=2 ^ 


convex characters with at most 7k — 5 states. This is because a convex character with 
k states corresponds to a size {k — 1) subset of the edges in the convex tree, and an 
unrooted tree on |X| taxa has at most 2|X| — 3 edges. Clearly, for constant k this yields 


a running time polynomial in |X|. (Prior to the Bounded States Theorem a constant 
upper bound of k states yielded only running times of the form 0(A;l^l): there are many 
more non-convex than convex characters on k states.) However, the bound does not 
automatically mean that questions such as “Is dup < t?” or “Is c^mp > can be 
answered in polynomial time for hxed, constant t. This is because in its current form the 


Bounded States Theorem] only holds for optimal characters: if we apply it to suboptimal 
characters we can still decrease the number of states by merging good pairs of states, 
but the parsimony distance achieved by the new character might increase compared to 
the old character. Expressed differently, the danger exists that for some values d < dMP, 
all convex characters achieving parsimony distance exactly d will have a huge number of 
states. This means that the obvious algorithmic stategy, of looping through all convex 
characters with an increasing number of states, does not have a clear stopping strategy, 
even for t hxed. 

Finally, we remark that optimal non-convex characters might have strictly fewer states 
than optimal convex characters. In the proof of Lemma 3.7 of |3] the following two trees 
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are shown which have c^mp = 2 : 

((((((( 1 , 2 ), 3 ), 4 ), 5 ), 6 ), 7 ), 8 ); 

(((1,3),(2,4)),((5,7),(6,8))); 

(The fact that dup = 2 is not proven there, but it can be easily verihed computationally). 
The proof there shows that 2 states are sufficient to achieve this maximum if non-convex 
characters are allowed, but 3 if we restrict to convex characters. It is natural to ask how 
far apart, in general, the minimum number of required states can be. 
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